Apply dimensionality reduction techniques to the Big Five and MBTI portions of the data and determine what components explain what personality traits.
Principal Components Analysis will be used for the dimensionality reduction in this session.
rawdata <- read_excel("/Users/lasgalen/Desktop/BDS 516/data in class/personality_test_data.xlsx", col_names = TRUE)
data_clean <- rawdata[-1,-55] # column 55 is the same as column 54, therefore drop column 55
data_clean <- as.data.frame(sapply(data_clean, as.numeric))
According to the data, we can see that the first 50 questions are Big Five test, and the following 70 questions are MBTI test.
reference: Big Five:https://canvas.upenn.edu/courses/1567977/files/95666049?module_item_id=20534870 MBTI: https://canvas.upenn.edu/courses/1567977/files/95666070?module_item_id=20534871
bigfive <- data_clean[,c(1:50)]
mbti <- data_clean[,c(51:120)]
pr.out_bigfive <- prcomp(drop_na(bigfive), scale = TRUE, center = TRUE)
summary(pr.out_bigfive)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6817 2.3170 2.08142 2.07554 1.75867 1.44765 1.41390
## Proportion of Variance 0.1438 0.1074 0.08665 0.08616 0.06186 0.04191 0.03998
## Cumulative Proportion 0.1438 0.2512 0.33785 0.42401 0.48586 0.52778 0.56776
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.24843 1.19557 1.16879 1.14573 1.04761 1.02021 0.99600
## Proportion of Variance 0.03117 0.02859 0.02732 0.02625 0.02195 0.02082 0.01984
## Cumulative Proportion 0.59893 0.62752 0.65484 0.68109 0.70304 0.72386 0.74370
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.97806 0.92315 0.90128 0.89583 0.85095 0.82386 0.80490
## Proportion of Variance 0.01913 0.01704 0.01625 0.01605 0.01448 0.01357 0.01296
## Cumulative Proportion 0.76283 0.77988 0.79612 0.81217 0.82666 0.84023 0.85319
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.77209 0.74759 0.73366 0.70329 0.66675 0.62017 0.60793
## Proportion of Variance 0.01192 0.01118 0.01077 0.00989 0.00889 0.00769 0.00739
## Cumulative Proportion 0.86511 0.87629 0.88705 0.89695 0.90584 0.91353 0.92092
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.5916 0.57781 0.56853 0.55792 0.51920 0.50044 0.49263
## Proportion of Variance 0.0070 0.00668 0.00646 0.00623 0.00539 0.00501 0.00485
## Cumulative Proportion 0.9279 0.93460 0.94106 0.94729 0.95268 0.95769 0.96254
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.47609 0.44500 0.43142 0.41390 0.40576 0.37883 0.36600
## Proportion of Variance 0.00453 0.00396 0.00372 0.00343 0.00329 0.00287 0.00268
## Cumulative Proportion 0.96707 0.97104 0.97476 0.97818 0.98148 0.98435 0.98703
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.34856 0.3390 0.30428 0.28758 0.26839 0.23885 0.23761
## Proportion of Variance 0.00243 0.0023 0.00185 0.00165 0.00144 0.00114 0.00113
## Cumulative Proportion 0.98946 0.9918 0.99361 0.99526 0.99670 0.99784 0.99897
## PC50
## Standard deviation 0.22683
## Proportion of Variance 0.00103
## Cumulative Proportion 1.00000
biplot(pr.out_bigfive)
pr.var_bigfive <- pr.out_bigfive$sdev^2
pve_bigfive <- pr.var_bigfive / sum(pr.var_bigfive)
plot(pve_bigfive, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve_bigfive), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
fviz_eig(pr.out_bigfive)
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
fviz_pca_var(pr.out_bigfive,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From the PCA analysis and the PCA result description graphs above, we can see that Q24, Q34, Q12 tend to be the most powerful question, while Q32, Q37, Q42, Q53 also plays important roles in determining the personality type.
pr.out_mbti <- prcomp(drop_na(mbti), scale = TRUE, center = TRUE)
summary(pr.out_mbti)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.8966 2.24527 1.88836 1.83641 1.79113 1.65255 1.58786
## Proportion of Variance 0.1199 0.07202 0.05094 0.04818 0.04583 0.03901 0.03602
## Cumulative Proportion 0.1199 0.19188 0.24282 0.29100 0.33683 0.37584 0.41186
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.54759 1.50800 1.43688 1.38622 1.33399 1.30985 1.27741
## Proportion of Variance 0.03421 0.03249 0.02949 0.02745 0.02542 0.02451 0.02331
## Cumulative Proportion 0.44608 0.47856 0.50806 0.53551 0.56093 0.58544 0.60875
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.24054 1.23067 1.18504 1.14823 1.11837 1.10746 1.08663
## Proportion of Variance 0.02198 0.02164 0.02006 0.01883 0.01787 0.01752 0.01687
## Cumulative Proportion 0.63074 0.65237 0.67244 0.69127 0.70914 0.72666 0.74353
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.06343 1.04221 1.02443 0.9862 0.97246 0.9241 0.92258
## Proportion of Variance 0.01616 0.01552 0.01499 0.0139 0.01351 0.0122 0.01216
## Cumulative Proportion 0.75968 0.77520 0.79019 0.8041 0.81760 0.8298 0.84195
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.89541 0.8776 0.83884 0.81739 0.80859 0.78296 0.77422
## Proportion of Variance 0.01145 0.0110 0.01005 0.00954 0.00934 0.00876 0.00856
## Cumulative Proportion 0.85341 0.8644 0.87446 0.88401 0.89335 0.90210 0.91067
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.73858 0.69638 0.6744 0.66679 0.64162 0.59865 0.59339
## Proportion of Variance 0.00779 0.00693 0.0065 0.00635 0.00588 0.00512 0.00503
## Cumulative Proportion 0.91846 0.92539 0.9319 0.93824 0.94412 0.94924 0.95427
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.57636 0.53445 0.51996 0.49989 0.47643 0.4735 0.45927
## Proportion of Variance 0.00475 0.00408 0.00386 0.00357 0.00324 0.0032 0.00301
## Cumulative Proportion 0.95901 0.96309 0.96696 0.97052 0.97377 0.9770 0.97998
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.44105 0.42140 0.38547 0.3649 0.33534 0.31390 0.30850
## Proportion of Variance 0.00278 0.00254 0.00212 0.0019 0.00161 0.00141 0.00136
## Cumulative Proportion 0.98276 0.98530 0.98742 0.9893 0.99093 0.99234 0.99370
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.29327 0.26975 0.25016 0.23457 0.19745 0.17125 0.16369
## Proportion of Variance 0.00123 0.00104 0.00089 0.00079 0.00056 0.00042 0.00038
## Cumulative Proportion 0.99493 0.99597 0.99686 0.99765 0.99820 0.99862 0.99900
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.13934 0.12981 0.12433 0.09669 0.06997 0.05290 0.03003
## Proportion of Variance 0.00028 0.00024 0.00022 0.00013 0.00007 0.00004 0.00001
## Cumulative Proportion 0.99928 0.99952 0.99974 0.99988 0.99995 0.99999 1.00000
biplot(pr.out_mbti)
pr.var_mbti <- pr.out_mbti$sdev^2
pve_mbti <- pr.var_mbti / sum(pr.var_mbti)
plot(pve_mbti, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve_mbti), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
fviz_eig(pr.out_mbti)
fviz_pca_ind(pr.out_mbti,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
fviz_pca_var(pr.out_mbti,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
## Warning: ggrepel: 38 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_pca_biplot(pr.out_mbti, repel = TRUE,
col.var = "#2E9FDF",
col.ind = "#696969"
)
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From the PCA analysis and the PCA result description graphs above, we can see that Q56, Q97, Q109 tend to be the most powerful question in determining the personality type.
Show if individuals are clustered according to Big Five and MBTI questions.
#PCA
fviz_pca_ind(pr.out_bigfive,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
set.seed(123)
# k-means
elbow_method_bigfive <- fviz_nbclust(drop_na(bigfive), FUNcluster = kmeans, method = "wss")
elbow_method_bigfive
silhouette_method_bigfive <- fviz_nbclust(drop_na(bigfive), FUNcluster = kmeans, method = "silhouette")
silhouette_method_bigfive
# The optimal number of cluster is 2
km.out_bigfive <- kmeans(drop_na(bigfive),2,nstart=20)
km.out_bigfive
## K-means clustering with 2 clusters of sizes 29, 59
##
## Cluster means:
## Q1 Q4 Q6 Q7 Q8 Q9 Q10 Q11
## 1 2.827586 2.034483 3.551724 4.068966 3.413793 2.310345 4.137931 2.724138
## 2 3.000000 1.593220 3.745763 2.728814 3.779661 2.440678 4.593220 2.779661
## Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19
## 1 2.172414 2.000000 3.344828 2.310345 4.206897 4.586207 4.137931 2.827586
## 2 3.440678 1.779661 4.203390 1.576271 4.203390 3.694915 4.016949 2.559322
## Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27
## 1 4.103448 2.793103 1.965517 1.586207 3.448276 2.000000 2.862069 3.275862
## 2 4.542373 2.000000 3.169492 1.593220 4.203390 2.067797 3.050847 2.423729
## Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35
## 1 3.862069 2.206897 3.931034 2.655172 4.034483 1.862069 2.896552 2.206897
## 2 4.033898 1.915254 3.949153 2.457627 1.932203 1.644068 3.457627 1.508475
## Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43
## 1 4.068966 3.896552 3.931034 3.482759 3.724138 2.172414 3.482759 3.103448
## 2 3.966102 2.288136 4.406780 3.474576 4.169492 1.677966 1.677966 2.932203
## Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52
## 1 2.965517 4.034483 3.724138 3.758621 4.586207 3.137931 3.896552 3.965517
## 2 3.084746 4.372881 3.711864 2.305085 4.406780 3.016949 4.288136 3.830508
## Q53 Q54
## 1 3.551724 4.034483
## 2 1.796610 4.271186
##
## Clustering vector:
## [1] 2 2 2 2 1 2 2 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2
## [77] 2 1 2 2 2 1 2 2 2 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 1532.069 3109.356
## (between_SS / total_SS = 10.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Hierarchical Clustering
hccomplete_bigfive <- hclust(dist(drop_na(bigfive)), method="complete")
dend_bigfive <- as.dendrogram(hccomplete_bigfive)
dend.bigfive <- color_branches(dend_bigfive, k = 2)
plot(dend_bigfive)
bigfiveclusters <- cutree(hccomplete_bigfive, k = 2)
# Gap Statitical Method
gap_stat <- clusGap(drop_na(bigfive), FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_stat_method <- fviz_gap_stat(gap_stat)
gap_stat_method
From the graph of individuals from the PCA based on Big Five questions, we can see that there are no significant clusters. However, individuals are tend to gather in the negative side of Dim1.
Then we ran the k-means clustering with elbow method and silhouette method. While there is no significant elbow, the results of silhouette method suggest the optimal number of clusters is 2. So we run k-means with 2 clusters. Two clusters have a size of 29 and 59 seperately.
After running Hierarchical Clustering. we color the Hierarchical Clustering tree with two different color and divide it inton two clusters.
The gap statistical method shows that the optimial clusters is one. We think this may because of the small sample size.
#PCA
fviz_pca_ind(pr.out_mbti,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
# k-means
elbow_method_mbti <- fviz_nbclust(drop_na(mbti), FUNcluster = kmeans, method = "wss")
elbow_method_mbti
silhouette_method_mbti <- fviz_nbclust(drop_na(mbti), FUNcluster = kmeans, method = "silhouette")
silhouette_method_mbti
# The optimal number of cluster is 2
km.out_mbti <- kmeans(drop_na(mbti),2,nstart=20)
km.out_mbti
## K-means clustering with 2 clusters of sizes 27, 50
##
## Cluster means:
## Q55 Q56 Q57 Q58 Q60 Q61 Q62 Q63
## 1 1.37037 1.62963 1.888889 1.444444 1.296296 1.222222 1.481481 1.296296
## 2 1.70000 1.14000 1.720000 1.140000 1.320000 1.260000 1.180000 1.520000
## Q64 Q65 Q66 Q67 Q68 Q69 Q70 Q71
## 1 1.592593 1.777778 1.888889 1.851852 1.555556 1.111111 1.37037 1.888889
## 2 1.420000 1.740000 1.760000 1.480000 1.340000 1.000000 1.46000 1.560000
## Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79
## 1 1.62963 1.666667 1.666667 1.148148 1.740741 1.407407 1.592593 1.851852
## 2 1.44000 1.540000 1.180000 1.040000 1.460000 1.480000 1.460000 1.700000
## Q80 Q81 Q82 Q83 Q84 Q85 Q86 Q87
## 1 1.666667 1.407407 1.444444 1.777778 1.111111 1.62963 1.777778 1.666667
## 2 1.480000 1.280000 1.080000 1.780000 1.420000 1.54000 1.800000 1.200000
## Q88 Q89 Q90 Q91 Q92 Q93 Q94 Q95
## 1 1.814815 1.962963 1.888889 1.074074 1.481481 1.703704 1.222222 1.518519
## 2 1.500000 1.740000 1.820000 1.340000 1.160000 1.520000 1.320000 1.260000
## Q96 Q97 Q98 Q99 Q100 Q101 Q102 Q103
## 1 1.518519 1.777778 1.962963 1.666667 1.925926 1.444444 1.592593 1.851852
## 2 1.320000 1.240000 1.880000 1.440000 1.800000 1.200000 1.660000 1.460000
## Q104 Q105 Q106 Q107 Q108 Q109 Q110 Q111 Q112
## 1 1.185185 1.222222 1.555556 1.518519 1.37037 1.62963 1.62963 1.481481 1.777778
## 2 1.060000 1.680000 1.200000 1.260000 1.10000 1.44000 1.14000 1.260000 1.600000
## Q113 Q114 Q115 Q116 Q117 Q118 Q119 Q120
## 1 1.592593 1.592593 1.666667 1.777778 1.518519 1.666667 1.148148 1.703704
## 2 1.180000 1.160000 1.400000 1.640000 1.160000 1.220000 1.320000 1.360000
## Q121 Q122 Q123 Q124 Q125
## 1 1.777778 1.518519 1.444444 1.666667 1.740741
## 2 1.640000 1.260000 1.280000 1.240000 1.120000
##
## Clustering vector:
## [1] 1 2 1 1 2 1 2 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2
## [39] 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 2 1
## [77] 2
##
## Within cluster sum of squares by cluster:
## [1] 364.2963 664.0000
## (between_SS / total_SS = 8.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Hierarchical Clustering
hccomplete_mbti <- hclust(dist(drop_na(mbti)), method="complete")
dend_mbti <- as.dendrogram(hccomplete_mbti)
dend.mbti <- color_branches(dend_mbti, k = 2)
plot(dend_mbti)
mbticlusters <- cutree(hccomplete_mbti, k = 2)
# Gap Statistic Method
gap_stat <- clusGap(drop_na(mbti), FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_stat_method <- fviz_gap_stat(gap_stat)
gap_stat_method
From the graph of individuals from the PCA based on MBTI questions, we can see that there are also no significant clusters. However, individuals are tend to gather in the middle(low cos2 values, means that these individuals have less contribution to Dim1 and Dim2).
Then we ran the k-means clustering with elbow method and silhouette method. While there is also no significant elbow, the results of silhouette method suggest the optimal number of clusters is still 2. So we run k-means with 2 clusters. Two clusters have a size of 50 and 27 seperately.
After running Hierarchical Clustering. we color the Hierarchical Clustering tree with two different color and divide it into two clusters.
The gap statistical method shows that the optimial clusters is one. We think this may because of the small sample size.
####3.
Compare the results of the dimensionality reduction of Big 5 and MBTI.
fviz_pca_biplot(pr.out_bigfive, repel = TRUE,
col.var = "#2E9FDF",
col.ind = "#696969",
title = "PCA Biplot for Big Five"
)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_pca_biplot(pr.out_mbti, repel = TRUE,
col.var = "#2E9FDF",
col.ind = "#696969",
title = "PCA Biplot for MBTI"
)
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From the PCA Biplot of Big Five and MBTI test, we can see that most of the variables in Big Five test are showing the large variation in Dim1 in both direction (positive and negative). But variables in MBTI test are mainly showing negative variance in Dim1.
Also, individuals in MBTI test are relatively more dispersive compared with individuals in Big Five test.
The PCA method is particularly useful when the variables within the data set are highly correlated. Correlation indicates that there is redundancy in the data. Due to this redundancy, PCA can be used to reduce the original variables into a smaller number of new variables (principal components) explaining most of the variance in the original variables. Considering the pattern of Big Five data and MBTI data demonstrated by PCA, Big Five data may have higher redundancy.
####4.
Check to see if there is correspondence between MBTI and Big Five items.
scale(drop_na(data_clean)) -> data_clean3
pr.out_data_clean3 <- prcomp(data_clean3, scale = TRUE, center = TRUE)
summary(pr.out_data_clean3)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.46362 2.90639 2.88738 2.69513 2.18728 1.94777 1.8909
## Proportion of Variance 0.09997 0.07039 0.06947 0.06053 0.03987 0.03162 0.0298
## Cumulative Proportion 0.09997 0.17036 0.23984 0.30037 0.34024 0.37185 0.4017
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.85466 1.82770 1.78015 1.7076 1.67318 1.61901 1.59684
## Proportion of Variance 0.02866 0.02784 0.02641 0.0243 0.02333 0.02184 0.02125
## Cumulative Proportion 0.43031 0.45815 0.48456 0.5089 0.53219 0.55403 0.57528
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.55494 1.52513 1.47649 1.428 1.36780 1.34882 1.34635
## Proportion of Variance 0.02015 0.01938 0.01817 0.017 0.01559 0.01516 0.01511
## Cumulative Proportion 0.59543 0.61481 0.63298 0.650 0.66557 0.68073 0.69584
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.33381 1.32231 1.27530 1.25594 1.22916 1.19432 1.17149
## Proportion of Variance 0.01483 0.01457 0.01355 0.01314 0.01259 0.01189 0.01144
## Cumulative Proportion 0.71066 0.72523 0.73879 0.75193 0.76452 0.77641 0.78784
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.16421 1.14647 1.12819 1.09700 1.06475 1.04389 1.01927
## Proportion of Variance 0.01129 0.01095 0.01061 0.01003 0.00945 0.00908 0.00866
## Cumulative Proportion 0.79914 0.81009 0.82070 0.83073 0.84017 0.84925 0.85791
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.0043 0.9918 0.96271 0.95582 0.93475 0.90534 0.86840
## Proportion of Variance 0.0084 0.0082 0.00772 0.00761 0.00728 0.00683 0.00628
## Cumulative Proportion 0.8663 0.8745 0.88224 0.88985 0.89713 0.90396 0.91025
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.84643 0.83824 0.82126 0.77233 0.76760 0.73332 0.7182
## Proportion of Variance 0.00597 0.00586 0.00562 0.00497 0.00491 0.00448 0.0043
## Cumulative Proportion 0.91622 0.92207 0.92769 0.93266 0.93757 0.94206 0.9464
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.69091 0.67419 0.67151 0.64043 0.6295 0.61496 0.60700
## Proportion of Variance 0.00398 0.00379 0.00376 0.00342 0.0033 0.00315 0.00307
## Cumulative Proportion 0.95033 0.95412 0.95788 0.96130 0.9646 0.96775 0.97082
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.56752 0.56432 0.5591 0.53054 0.50725 0.48691 0.47846
## Proportion of Variance 0.00268 0.00265 0.0026 0.00235 0.00214 0.00198 0.00191
## Cumulative Proportion 0.97350 0.97616 0.9788 0.98111 0.98325 0.98523 0.98714
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.46689 0.43168 0.42076 0.40039 0.37667 0.36031 0.35412
## Proportion of Variance 0.00182 0.00155 0.00148 0.00134 0.00118 0.00108 0.00105
## Cumulative Proportion 0.98895 0.99051 0.99198 0.99332 0.99450 0.99558 0.99663
## PC71 PC72 PC73 PC74 PC75 PC76
## Standard deviation 0.33075 0.31150 0.29128 0.24175 0.23497 1.538e-15
## Proportion of Variance 0.00091 0.00081 0.00071 0.00049 0.00046 0.000e+00
## Cumulative Proportion 0.99754 0.99835 0.99905 0.99954 1.00000 1.000e+00
biplot(pr.out_data_clean3)
pr.var_data_clean3 <- pr.out_data_clean3$sdev^2
pve_data_clean3 <- pr.var_data_clean3 / sum(pr.var_data_clean3)
plot(pve_data_clean3, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve_data_clean3), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
fviz_eig(pr.out_data_clean3)
fviz_pca_var(pr.out_data_clean3,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
## Warning: ggrepel: 85 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Question 35 (“I am not really interested in others”) and question 55 (easy or not to speak to strangers) are very close to each other. The questions around them are also relevant to how comfortable one is when being around with people. After investigating their nature, we found that they are both questions related to “ex/introversion.”
Question 20,30, and 40 are shown to be substantial and on the same direction. From the questions, they seem to be measuring “agreeableness” (to what degree an individual adjust their behavior to suit others.) The same quadrant also stands question 73 and 80 from mbti are also similar in that they measure how much you value harmonious human relationship. Therefore, we think that this quadrant is related to the concept of “agreeableness.”
####5.
Predict my MBTI personality using an appropriate method and compare it with my actual score
data_clean <- drop_na(data_clean)
elbow_method <- fviz_nbclust(data_clean, FUNcluster = kmeans, method = "wss")
elbow_method
silhouette_method <- fviz_nbclust(data_clean, FUNcluster = kmeans, method = "silhouette")
silhouette_method
According to the elbow and silhouette analysis, we can see that the optimal number of clusters is 3.
# The optimal number of cluster is 3
# Use Hierarchical Clustering to identify the clusters
hccomplete <- hclust(dist(drop_na(data_clean)), method="complete")
dend <- as.dendrogram(hccomplete)
dend <- color_branches(dend, k = 3)
plot(dend)
clusters <- cutree(hccomplete, k = 3)
data_clean %>% mutate(cluster = clusters) %>%
group_by(cluster) %>%
summarise_all(funs(mean(., na.rm = T)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 3 x 121
## cluster Q1 Q4 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.82 1.76 3.87 2.8 3.78 2.51 4.58 2.58 3.29 1.87 4.27
## 2 2 2.92 2 3.62 4.12 3.65 2.15 4.12 2.58 2.19 1.92 3.46
## 3 3 3.8 1.2 2 2.4 4.6 3 4.8 4.6 3.8 1.6 4
## # … with 109 more variables: Q15 <dbl>, Q16 <dbl>, Q17 <dbl>, Q18 <dbl>,
## # Q19 <dbl>, Q20 <dbl>, Q21 <dbl>, Q22 <dbl>, Q23 <dbl>, Q24 <dbl>,
## # Q25 <dbl>, Q26 <dbl>, Q27 <dbl>, Q28 <dbl>, Q29 <dbl>, Q30 <dbl>,
## # Q31 <dbl>, Q32 <dbl>, Q33 <dbl>, Q34 <dbl>, Q35 <dbl>, Q36 <dbl>,
## # Q37 <dbl>, Q38 <dbl>, Q39 <dbl>, Q40 <dbl>, Q41 <dbl>, Q42 <dbl>,
## # Q43 <dbl>, Q45 <dbl>, Q46 <dbl>, Q47 <dbl>, Q48 <dbl>, Q49 <dbl>,
## # Q50 <dbl>, Q51 <dbl>, Q52 <dbl>, Q53 <dbl>, Q54 <dbl>, Q55 <dbl>,
## # Q56 <dbl>, Q57 <dbl>, Q58 <dbl>, Q60 <dbl>, Q61 <dbl>, Q62 <dbl>,
## # Q63 <dbl>, Q64 <dbl>, Q65 <dbl>, Q66 <dbl>, Q67 <dbl>, Q68 <dbl>,
## # Q69 <dbl>, Q70 <dbl>, Q71 <dbl>, Q72 <dbl>, Q73 <dbl>, Q74 <dbl>,
## # Q75 <dbl>, Q76 <dbl>, Q77 <dbl>, Q78 <dbl>, Q79 <dbl>, Q80 <dbl>,
## # Q81 <dbl>, Q82 <dbl>, Q83 <dbl>, Q84 <dbl>, Q85 <dbl>, Q86 <dbl>,
## # Q87 <dbl>, Q88 <dbl>, Q89 <dbl>, Q90 <dbl>, Q91 <dbl>, Q92 <dbl>,
## # Q93 <dbl>, Q94 <dbl>, Q95 <dbl>, Q96 <dbl>, Q97 <dbl>, Q98 <dbl>,
## # Q99 <dbl>, Q100 <dbl>, Q101 <dbl>, Q102 <dbl>, Q103 <dbl>, Q104 <dbl>,
## # Q105 <dbl>, Q106 <dbl>, Q107 <dbl>, Q108 <dbl>, Q109 <dbl>, Q110 <dbl>,
## # Q111 <dbl>, Q112 <dbl>, Q113 <dbl>, Q114 <dbl>, Q115 <dbl>, Q116 <dbl>, …
According to Alex’s score, we checked Q35, 105, 20, 30, 40, 73, and 80, we find that his answers are closer to the average score of cluster 3 (his answer of Q35, and 80 are closer to that of cluster2, Q105, 20, 30, 40 are closer to that of cluster 3, Q73 is closer to that of cluster 1).
Q35, 105 are questions identify introversion/extroversion, according to the average score of these two questions, we can infer that the people belong to cluster 2 is tend to be introverted. Q20, 30, 40 are question to be measuring “agreeableness” (to what degree an individual adjust their behavior to suit others.) The people in cluster 3 tend to be more agreeable and soft-heart.
Q73, 80 measure how much you value harmonious human relationship. People in cluster 1 tend to be enjoying harmonious human relationship.
From both results of big five and mbti, so we can infer that Alex is on the more introverted side, as well as being a agreeable person. He may also value harmonious social relationship.